Normalized z-score for genomic association tests
Authors: Roberto Malinverni and David Corujo Gracia
Introduction
In the R package regioneReloaded we implement the use of a “normalized z-score” (nZS) as a useful parameter to reduce the z-score (ZS) value dependency on the number of regions included in the context of a permutation test using genomic regions. The nZS is defined as the ZS divided by the square root of the number of elements in a region set.
In this script, we showcase a series of examples to test and empirically demonstrate the relationship between the ZS and nZS values when performing permutation tests. First, we simplify the problem to a minimal probabilistic scenario. Then, we report the results of anlaysis run on publicly available ChIP-Seq datasets from the ENCODE project.
Simulated samples
The use of the nZS was originally conceptualized and tested in the context of evaluating the number of overlaps between genomic regions, although we test its usability with other evaluation methods later in this document.
An overlap between two genomic regions can be defined by different criteria, but eventually comes down to a binary event: either two regions are considered to overlap or not. We model this probabilistic behavior by reducing the concept of region sets to numerical vectors. In essence, we can think of these numbers as coordinates on the genome and the regions reduced to single points. If the same number is present in two vectors, that represents one “overlap”.
Following this reasoning, we create 5 vectors (Figure 1) containing a set of numbers (not replicated), all of them chosen from a “coordinate universe” of 1 to 100000 with the following rules:
Tester - 1000 random numbers chosen from our universe
test1 - 250 numbers: 50 present in Tester and 200 random
test1_bis - 1050 numbers: 50 present in Tester and 1000 random
test1_ter - 3000 numbers: 1000 present in Tester and 2000 random
test2 - 500 random numbers
test3 - 500 random numbers not present in Tester
testX samples will be associated with Tester using permutation tests to simulate divergent cases of association.
Simple permutation test
The testing approach presented here performs a permutation test between one of the above described vectors and the Tester vector to determine if the observed “overlap” is significantly different from what we would “expect” randomly. We randomize 5000 times the query vector by sampling the same number of elements from the universe, and record each time the number of overlaps with Tester. We then take the initial observed overlap and compare it to the random distribution, obtaining a Z-score value and an associated p-value. In these examples, we will focus exclusively on the z-score and the normalized z-score.
To assess the variability of the z-score depending on the size of the query set used, we repeat the permutation test several times using increasing sub-fractions of the sample.
Below, we show the observed overlap between each vector and our Tester, as well as the ZS and nZS values obtained using a different fraction of each vector. While the absolute value of the ZS for significant associations greatly increases with bigger sample sizes, the nZS remains much more stable and in a more comparable range between different tests.
Reproduciblity and stability of ZS and nZS values
Given the stochastic nature of the permutation test, we repeat the procedure described above 100 times for each comparison, with 5000 randomization steps each time. This approach allows us to evaluate the robustness of the results obtained in terms of obtained ZS and nZS values.
Below, boxplot representations of the 100 results obtained for each test and subfraction of the sample. We observe that results are highly reproducible and, in particular, that the nZS shows a very good stability.
Conclusion
From the given examples, we can deduce that the normalized z-score tends to be stable, while the regular z-score seems to depend on the number of items considered. When assessing a library of biological data, we cannot always guarantee that the elements we identify cover the entirety of the elements present. The normalized z-score is useful in portraying the degree of association in such scenarios. However, the normalized z-score should be employed solely when the p-value linked to the permutation test (and hence the z-score) is statistically significant. As seen in Figure 5 and Figure 10, both the z-score and the normalized z-score are notably variable, yet they oscillate around a value of 0. With default parameters, the functions in ‘regioneReloaded’ will replace the value of the normalized z-score with 0 in this context. The threshold p-value and substition value for this action can be adjusted through the function’s options.
Test with biological datasets
To test the behaviour of the permutation test and its associated z-zscore and normalized z-score, we make use of publicly available ChIP-Seq datasets generated and available through the ENCODE project. In particular, we will use the called peaks as annotated by the ENCODE pipeline as genomic region sets to input into regioneReloaded.
In particular, we test the association of H3K4me3 peaks with subunits of the RNA Polymerase II complex (POLR2A and POLR2G) and the histone post-translation modifications H3K27Ac and H3K9me3. H3K4me3 is a histone modification strongly enriched at the promoter region of actively transcribed genes. Hence, we expect an association with POLR2A and POLR2G which form part of the transcriptional machinery. H3K27Ac is another histone mark generally enriched at active promoters and enhancer elements, hence we also expect a positive association with H3K4me3. On the other hand, H3K9me3 is a histone mark present mostly at heterochromatic elements that are transcriptionally silenced, which means it will rarely coincide with the above mentioned region sets. The following results were obtained by using the indicated randomization functions, 5000 permutations and numOverlaps as an evaluation function with count.once set to TRUE.
As mentioned above, we originally introduced the nZS in the context of using the overlap between genomic regions as an evaluation, but it is in principle also usable with other evaluation functions. Here we perform the same tests as shown above but using the mean distance between regions as an evaluation function, and observe a similar effect of the nZS being more stable versus different region set sizes. Note that in this case, associatoin that had positive z-score values with number of overlaps now have negative values, since the regions are closer (the distance is smaller between them) than the randomized distribution.